Skip to content

feat(redteam): add built-in red teaming support#184

Merged
poshinchen merged 1 commit into
strands-agents:mainfrom
kevmyung:feat/red-team-foundation
Jun 2, 2026
Merged

feat(redteam): add built-in red teaming support#184
poshinchen merged 1 commit into
strands-agents:mainfrom
kevmyung:feat/red-team-foundation

Conversation

@kevmyung

@kevmyung kevmyung commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

Description

Adds an experimental red-teaming module (strands_evals.experimental.redteam) that lets users run multi-turn adversarial attacks against a target agent and score whether safety guardrails hold. The module composes existing Strands Evals primitives (Case, Experiment, Evaluator, ActorSimulator) rather than introducing a parallel framework.

Two-step flow

1. Generate cases. AdversarialCaseGenerator infers risk categories from the target's system prompt and tools, then generates per-category attack cases via an LLM. Custom cases can be authored directly via RedTeamCase + AttackGoal for domain-specific business rules.

case-generation

2. Run evaluations. RedTeamExperiment drives a multi-turn attacker conversation against the target, captures the full conversation + tool trace, and scores it with an LLM judge.

run-evaluation
  from strands import Agent
  from strands_evals.experimental.redteam import (
      AdversarialCaseGenerator, RedTeamExperiment,
  )

  agent = Agent(system_prompt="You are a customer service agent.", tools=[...])

  cases = AdversarialCaseGenerator(model=model).generate_cases(target=agent)
  report = RedTeamExperiment(cases=cases, target=agent).run_evaluations()
  report.display()

What ships

  • Pipeline: AdversarialCaseGenerator → RedTeamExperiment → AttackSuccessEvaluator → RedTeamReport.
  • Built-in risk categories: guideline_bypass, system_prompt_leak, harmful_content, data_exfiltration, excessive_agency. Auto-inferred from target info when not specified.
  • Strategy: gradual_escalation (prompt-driven) as the default. AttackStrategy ABC + PromptStrategy are in place; algorithmic strategies (Crescendo, PAIR, TAP, etc.) land in follow-up PRs (see design
    doc §2.4).
  • Evaluator: AttackSuccessEvaluator — LLM-as-judge with a 4-anchor continuous rubric (refused / partial / substantial / full, scores 0.0–1.0) over conversation + tool trace.
  • Report: RedTeamReport.display() for console summary, plus attack_results(), by_risk_category(), by_strategy(), failed_cases for programmatic access.
  • Framework-agnostic targets: any Callable[[str], str | dict] works alongside Strands Agent (with optional trace capture via the dict shape).
  • Lives under experimental/ — API may change before promotion.

Related Issues

Closes #220.

Type of Change

New feature (experimental module).

Testing

  • hatch run prepare — 1119 passed.
  • e2e smoke against Bedrock targets : cases generated, multi-turn attacks executed, judge scored anchor points cleanly.
  • Unit tests cover: generator (mocked LLM), task runner, experiment wiring, report aggregation, evaluator prompt assembly, agent adapter, strategy registry contract.

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@poshinchen poshinchen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use built-in python | / list instead of typing's deprecated Union, List and so on?

@kevmyung kevmyung temporarily deployed to manual-approval May 1, 2026 15:23 — with GitHub Actions Inactive
@kevmyung

kevmyung commented May 1, 2026

Copy link
Copy Markdown
Contributor Author

Could you use built-in python | / list instead of typing's deprecated Union, List and so on?

Quick heads-up – fixed it in 438f9e0

Comment thread src/strands_evals/evaluators/red_team_judge_evaluator.py Outdated
Comment thread src/strands_evals/evaluators/red_team_judge_evaluator.py Outdated
Comment thread src/strands_evals/evaluators/red_team_judge_evaluator.py Outdated
Comment thread src/strands_evals/evaluators/red_team_judge_evaluator.py Outdated
Comment thread src/strands_evals/evaluators/red_team_judge_evaluator.py Outdated
Comment thread src/strands_evals/redteam/evaluators/red_team_judge_evaluator.py Outdated
@yeomjiwonyeom

Copy link
Copy Markdown
Contributor

Created sub-issues under #177 to track P0/P1 work:

This PR (#184) covers the infrastructure layer of P0. Checked items in #220 reflect what's already implemented here:

  • AttackStrategy ABC, RiskCategory, AttackGoal types
  • AttackSuccessEvaluator (0.0–1.0 continuous scoring on execution traces)
  • RedTeamJudgeEvaluator (binary per-metric)
  • red_team(agent) entry point with auto tool extraction
  • RedTeamReport with grouped views
  • Multi-turn conversation loop via ActorSimulator
  • 1 strategy: gradual_escalation

Remaining P0 work (tracked in #220):

  • 5 multi-turn strategies: Crescendo, Linear/PAIR, TAP/TreeJailbreaking, BadLikertJudge, SequentialBreak
  • RedTeamExperiment orchestrator
  • AdversarialActorSimulator / AdversarialCaseGenerator
  • Turn budget increase to 20-50

@kevmyung

Copy link
Copy Markdown
Contributor Author

@poshinchen Resolved your comments in d750fe0:

  • Moved red-team evaluators under src/strands_evals/redteam/evaluators/
  • Unified attack_strategies API (accepts strings or AttackStrategy instances)
  • Fixed extract_tool_info to handle get_all_tools_config's dict shape
  • Addressed remaining nits (typing, unused symbols, dead branches, docstrings)

@poshinchen

Copy link
Copy Markdown
Contributor

/strands review the PR

Comment thread src/strands_evals/redteam/runner.py Outdated
Comment thread src/strands_evals/redteam/runner.py Outdated
Comment thread src/strands_evals/redteam/runner.py Outdated
Comment thread src/strands_evals/redteam/presets.py Outdated
Comment thread src/strands_evals/redteam/evaluators/red_team_judge_evaluator.py Outdated
Comment thread src/strands_evals/redteam/agent_adapter.py Outdated
@github-actions

Copy link
Copy Markdown

Issue: This PR introduces a significant new public API surface (strands_evals.redteam) with multiple abstractions customers will use (red_team(), AttackStrategy, RedTeamJudgeEvaluator, AttackSuccessEvaluator, presets). Per the API Bar Raising guidelines, this likely warrants the needs-api-review label.

The PR description documents the components well, but missing from an API review perspective:

  • Module-level import paths (e.g., can users do from strands_evals import red_team?)
  • Currently the redteam module is not re-exported from strands_evals/__init__.py
  • Whether JAILBREAK, PROMPT_EXTRACTION, HARMFUL_CONTENT should be public constants vs. accessed via the registry

Suggestion: Add needs-api-review label and decide on the import ergonomics. At minimum, consider adding redteam to the top-level __init__.py lazy imports so users can write:

from strands_evals.redteam import red_team

and document whether this is the intended public entry point.

Comment thread tests/strands_evals/redteam/test_runner.py Outdated
Comment thread src/strands_evals/redteam/strategies/base.py Outdated
@github-actions

Copy link
Copy Markdown

Review Summary

Assessment: Comment (Request Changes on specific items)

Solid foundation for red teaming capabilities. The architecture cleanly separates concerns (presets vs. strategies vs. evaluators vs. runner) and the red_team() API provides a simple entry point. Two items I'd ask to address before merging:

Review Categories
  • Concurrency Safety: The shared tool_trace mutable list pattern will silently corrupt data if the experiment runs with parallel workers. This needs either a fix or an explicit max_workers=1 constraint.
  • Test Coverage Gaps: AttackSuccessEvaluator and agent_adapter.py are untested — both are part of the public API surface.
  • API Surface: This introduces a substantial new public module. Consider adding needs-api-review label and clarifying the intended import paths (top-level re-export vs. submodule).
  • Evaluator Aggregation: The multi-metric judge evaluator's outputs get averaged, which can mask critical safety failures. Worth a deliberate design decision on the aggregation semantics.
  • Reproducibility: No seed parameter for case generation makes CI/CD regression testing non-deterministic.

The separation of "what to attack" (presets) from "how to attack" (strategies) is a clean design that should scale well as more strategies land in the follow-up PRs.

@github-actions

Copy link
Copy Markdown

Review Summary (Round 4)

Assessment: Request Changes

All Round 3 items were addressed well. However, a correctness issue remains with the shared target Agent state across cases.

Review Details
  • Correctness (Critical): The target Agent's messages accumulate across all red team cases with no reset. This breaks case isolation — later cases see earlier attack conversations, and the context window will overflow on larger runs. Each case should evaluate the target independently.
  • Concurrency safety: run_evaluations_async inherits max_workers=10 from the base class, but the task function captures a shared mutable Agent. Parallel execution would corrupt state. Default should be max_workers=1 for the async path too.
  • Robustness: _infer_risk_categories doesn't guard against structured_output = None, unlike _generate_cases_for_category which does.

The first issue (agent state isolation) is the only blocker — it affects correctness of all multi-case runs. The other two are defensive improvements.

Comment thread src/strands_evals/experimental/redteam/generators/adversarial.py
Comment thread src/strands_evals/experimental/redteam/evaluators/attack_success_evaluator.py Outdated
Comment thread src/strands_evals/experimental/redteam/generators/adversarial.py Outdated
Comment thread src/strands_evals/experimental/redteam/generators/adversarial.py Outdated
Comment thread src/strands_evals/experimental/redteam/generators/adversarial.py
Comment thread src/strands_evals/experimental/redteam/experiment.py Outdated
Comment thread src/strands_evals/experimental/redteam/case.py
Comment thread src/strands_evals/experimental/redteam/experiment.py
@kevmyung kevmyung force-pushed the feat/red-team-foundation branch from c3d82fb to 16aa4b0 Compare May 22, 2026 16:13
@kevmyung kevmyung temporarily deployed to manual-approval May 22, 2026 16:13 — with GitHub Actions Inactive
@poshinchen

Copy link
Copy Markdown
Contributor

Also, does the experiment return list[RedTeamReport] or just single RedTeamReport?
Customers can pass multiple evaluators given an experiment and it'll generate list of report. The Base Report class has flatten method, user can just use that.

@kevmyung

Copy link
Copy Markdown
Contributor Author

Also, does the experiment return list[RedTeamReport] or just single RedTeamReport? Customers can pass multiple evaluators given an experiment and it'll generate list of report. The Base Report class has flatten method, user can just use that.

Returns a single RedTeamReport. We collect list[EvaluationReport] from the base call internally and merge them case-keyed - needed for case-centric views (failed_cases, by_risk_category(), display()). Happy to switch to base flatten.

Comment thread src/strands_evals/experimental/redteam/report.py Outdated
Comment thread src/strands_evals/experimental/redteam/task.py
@github-actions

Copy link
Copy Markdown

Review Summary (Round 5)

Assessment: Comment (Approve with minor fixes)

All critical and important issues from Round 4 (agent state isolation, async max_workers, None guard) have been properly addressed. The module is in good shape.

Remaining Items
  • Robustness: assert for data validation in report.py:69 will be stripped by Python -O, use explicit if/raise instead.
  • Style: Log messages don't follow the repo's STYLE_GUIDE.md format (field=<%s> | message). Consistent across all files in the module.

Neither item is blocking. The architecture is clean, test coverage is solid (7 test files covering all major components), and the layered design properly reuses existing framework primitives.

Adds an experimental red-teaming module under src/strands_evals/experimental/redteam/
that extends Strands Evals base types (Case, Experiment, Evaluator, ActorSimulator)
with adversarial counterparts.

- AdversarialCaseGenerator: generates RedTeamCases per risk category, with
  optional auto-inference of categories from target tools/system_prompt
- RedTeamExperiment: orchestrates multi-turn attacker/target conversations
- AttackSuccessEvaluator: continuous 0.0-1.0 LLM-as-judge over conversation +
  tool execution traces
- AdversarialActorSimulator: ActorSimulator subclass shared across strategies
- AttackStrategy + PromptStrategy with gradual_escalation as the default
@kevmyung kevmyung force-pushed the feat/red-team-foundation branch from 16aa4b0 to 1ecc290 Compare May 22, 2026 18:01
@kevmyung kevmyung temporarily deployed to manual-approval May 22, 2026 18:01 — with GitHub Actions Inactive
@kevmyung kevmyung temporarily deployed to manual-approval May 22, 2026 18:01 — with GitHub Actions Inactive
Comment thread src/strands_evals/experimental/redteam/task.py
Comment thread src/strands_evals/experimental/redteam/task.py
Comment thread src/strands_evals/experimental/redteam/task.py
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

Review Summary

Assessment: Approve (with minor suggestions)

The module has matured significantly through 5+ prior review rounds. All critical issues from earlier rounds (agent state isolation, concurrency safety, None guards, assert-for-validation) are resolved. The architecture cleanly composes existing framework primitives and the test coverage is thorough (762 lines of tests across 7 test files for 1198 lines of source).

Remaining Suggestions
  • Style: Log format doesn't follow STYLE_GUIDE.md pattern (field=<%s> | message) — 8 calls across task.py and generators/adversarial.py.
  • Defensive coding: No post-strategy.enhance() empty check — future algorithmic strategies could return empty and waste a turn.
  • Test coverage: Dict-returning callable target path in _call_target lacks direct unit test coverage.

None of these are blocking. The experimental/ namespace properly signals API instability, and the design (Generator → Experiment → Task → Evaluator → Report) is clean and extensible.

@jjbuck jjbuck left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with just a few non-blocking nits noted for eventual transition from experimental to main.

@poshinchen poshinchen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's iterate the action items in the follow up PRs

@poshinchen poshinchen merged commit 364e132 into strands-agents:main Jun 2, 2026
15 checks passed
poshinchen pushed a commit that referenced this pull request Jun 9, 2026
* fix(redteam): align log format and cover dict-target path

Carry-over nits from PR #184:
- Align 8 log calls in task.py and generators/adversarial.py to the
  project's field=<%s> | message convention (no punctuation/capitals).
- Add unit tests for the _call_target dict-target branch (with and
  without a trace key), which was previously untested.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(redteam): unify strategies on run_attack with strategy-agnostic cases

Every AttackStrategy now owns its multi-turn loop via an abstract
run_attack(case, call_target, ...) -> AttackRunResult; the task runner
injects call_target (target invocation + tool-trace capture + per-case
messages.clear isolation) and no longer branches on strategy type.

Why: a single execution model (strategy owns its loop) is simpler than a
runner-owned loop plus a per-strategy exception. Cases become
strategy-agnostic (no strategy/template baked into RedTeamConfig); the
RedTeamExperiment holds the strategy instances and expands the
case x strategy cross-product at run time, so hand-crafted cases and
strategy comparison (by label) are both first-class.

- base.py: run_attack @AbstractMethod + AttackRunResult dataclass; add
  label (instance id, defaults to name); remove the unused enhance().
- PromptStrategy: relocate the ActorSimulator loop from task.py into
  run_attack (gradual_escalation behavior unchanged).
- RedTeamConfig: drop strategy/system_prompt_template + their validator.
- generators/adversarial: generate_cases emits strategy-agnostic cases;
  rename target -> agent; drop attack_strategies.
- experiment: rename target -> agent; accept attack_strategies; build
  _by_label (duplicate label -> ValueError); expand cross-product before
  delegating to the base worker (left untouched).
- task: build call_target, look up the case's strategy by label, map
  AttackRunResult to the {"output", "trajectory", ...} dict.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(redteam): add Crescendo multi-turn attack strategy

CrescendoStrategy escalates gradually across turns, each attacker message
building on the target's previous answer. On a refusal it backtracks by
simply not appending the refused (question, response) pair and retrying with
a fresh question (up to max_backtracks), so the refused turn never enters the
history — a simpler equivalent of PyRIT's excluding-last-turn approach. It
stops early once a turn scores at/above success_threshold.

The refusal/success/question-generation helpers are module-level functions
(is_refusal, success_score, gen_escalating_question) rather than methods, so
future strategies (PAIR, TAP) can reuse them without importing a strategy
class. They power the strategy's cheap in-loop "should I stop?" gate;
success_score reads the case's success_criteria — the same input the
authoritative AttackSuccessEvaluator uses — so the two never disagree on what
counts as success, while the evaluator remains the sole verdict over the full
trace. Parse failures degrade safely (question -> terminate preserving the
conversation; judge -> score 0 and keep looping); only the evaluator raises.

The attacker model resolves to the ctor model first, then the experiment
model. CrescendoStrategy is exported but intentionally NOT in
BUILTIN_STRATEGIES (it is user-instantiated with params).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(redteam): add per-failure drill-down to the report

The aggregate sections (top-line attack-success rate, by_risk_category,
by_strategy when more than one strategy ran) are unchanged. Each failure
line now also shows the attacker's objective and the strategy's per-run
stats (turns used, backtracks) so a multi-turn result like Crescendo is
legible at a glance, not just a single score.

The strategy's run metadata reaches the report by merging
AttackRunResult.metadata onto the case metadata in the task function; the
base Experiment shares that dict with the EvaluationData it builds, so no
base change is needed. Full turn-by-turn conversation output is left for a
future verbose mode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(redteam): end-to-end wiring tests + fix strategy metadata join

Cover both user paths through the full locked interface with only the LLM
layer mocked (attacker, in-loop judge, and the evaluator's judge agent):
- generated cases: generate_cases(agent=...) -> RedTeamExperiment with
  CrescendoStrategy -> run_evaluations -> RedTeamReport.
- hand-crafted cases: the same pipeline from RedTeamCase objects built by
  hand, skipping the generator (Model B's first-class path).

Live (real-Bedrock) runs surfaced a wiring bug these mock tests now guard:
the strategy's run metadata (turns_used, backtracks) never reached the
report. task_fn mutated case.metadata, but Pydantic copies that dict into a
fresh EvaluationData, and the base Experiment doesn't carry task-returned
metadata anyway. Fix: the experiment now collects each case's run metadata
(keyed by case name) and joins it onto the report in
RedTeamReport.from_evaluation_reports — keeping the base untouched and the
collection logic on the RedTeamExperiment layer (where it stays put if the
experiment later stops extending the base).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(redteam): address pre-merge review (idempotency, refusal accuracy, leaks)

Adversarial self-review before opening the PR surfaced two correctness bugs
and several maintainability issues; fixing them here.

Correctness:
- Cross-product expansion was mutating self._cases in place, so re-running an
  experiment squared it (c0__cre -> c0__cre__cre). _expand_cross_product is now
  pure (returns a new list) and run_evaluations_async swaps/restores self._cases
  around the base run, making reruns idempotent.
- is_refusal flagged compliant text containing refusal substrings ("I cannot
  stress enough... here are the steps", "I apologize, here is..."), dropping
  successful attacks from the trace and biasing results toward "attack failed".
  Markers are now only a cheap negative prefilter; on a marker hit a refusal
  judge (the previously-unused REFUSAL_JUDGE_SYSTEM_PROMPT) disambiguates, with
  a safe "keep the turn" fallback on parse failure.

Maintainability:
- Removed the leaky AttackRunResult.trajectory field (the task owns the trace
  via call_target); task_fn now assembles the output/trajectory payload directly.
- Unified turns_used to "turns kept in the conversation" across strategies;
  Crescendo additionally reports target_calls (incl. refused, backtracked calls).
- Documented max_turns as an experiment-level ceiling (strategy runs min of the
  two), the no-success_criteria behavior, and the max_workers=1 requirement;
  run_evaluations_async now rejects max_workers != 1 instead of relying on a comment.
- Dropped the now-unused resolve_strategy/DEFAULT_STRATEGY public surface.

Tests: idempotency, refusal false-positives + judge disambiguation, all-refusal
empty conversation, ctor-vs-injected max_turns both directions, no-criteria run,
direct async entry + coroutine/max_workers guards; e2e now asserts exact
turns_used/backtracks with an engaging (non-refusal) target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(redteam): add report.display(verbose=True) to show failure conversations

An LLM judge can be fooled by a target that *claims* to leak — e.g. a target
that, under escalation, emits a code block it presents as "my system prompt"
which may be partly hallucinated. The aggregate report can't be verified by
eye without the transcript.

display(verbose=True) now prints each failed case's full attacker/target
conversation (default stays the compact aggregate + one-line drill-down), so a
user can confirm whether a flagged "success" is a real leak or a false positive.
The conversation is carried on AttackResult.conversation (from the case's
actual_output).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(redteam): strategies own their turn budget; drop experiment max_turns

The experiment's max_turns (default 10) silently capped every strategy via
min(strategy_max_turns, experiment_max_turns), so CrescendoStrategy(max_turns=30)
under the default experiment ran only 10 turns — quietly breaking the
compare-same-strategy-different-params use case.

Each strategy now owns its turn budget; the task passes MAX_ALLOWED_TURNS (50)
as a hard ceiling, so turn_cap = min(strategy.max_turns, 50). Removed max_turns
from RedTeamExperiment.__init__ entirely. Added max_turns to PromptStrategy so
gradual_escalation keeps its prior default of 10 (and its {max_turns} prompt
text) rather than jumping to the ceiling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(redteam): second adversarial pass + address reviewer bot feedback

- judges score each response statelessly (clear history per call) so
  earlier turns don't bias the in-loop refusal/success verdicts
- correct backtrack docstring: it is report-scope only, the target's
  own context is not rolled back; add a proof test
- drop dead keys from task_fn return dict (base reads only output/trajectory)
- export AttackRunResult publicly (part of the strategy extension contract)
- remove unused system_prompt_template from base AttackStrategy
- fix log-statement separators; extract dense metadata merge into locals
- add hardening cross-ref comments (lazy-init attacker, _cases swap)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(redteam): add TargetSession protocol and implementations

Introduce TargetSession (invoke/snapshot/restore/supports_rewind/trace) as the
handle a strategy uses to talk to the target, replacing the opaque call_target
in a follow-up. AgentTargetSession wraps a strands.Agent and is rewindable via
the SDK snapshot API (deep-copy rollback); CallableTargetSession wraps an opaque
callable and reports supports_rewind=False. Bumps strands-agents floor to
>=1.36.0 for the snapshot API.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(redteam): replace call_target with TargetSession across strategies

Switch run_attack from call_target: Callable[[str], str] to a TargetSession,
so a strategy can roll the target back via the session's snapshot/restore.
Crescendo now does a real state rollback on a refusal for rewindable (Agent)
targets and degrades to report-scope backtracking for opaque callables; both
keep the refused turn in AttackRunResult.pruned_branches as defended-turn
evidence. The report surfaces that evidence: display() is flattened to a
case x strategy matrix plus a per-attack table (every attack, breached and
defended), closing the gap where a fully-defended run looked empty. Score
aggregation across evaluators switches min -> max (worst-case = strongest
attack). The trace is rolled back alongside messages so backtracked tool calls
no longer ghost into the trajectory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(redteam): address adversarial-review findings on TargetSession + report

- Matrix now pivots on the base case name: the cross-product names work items
  "{case}__{strategy}", so without stripping the suffix every cell landed on its
  own row and the case x strategy grid was meaningless (AR-5).
- Replace the snapshot.app_data["_trace_len"] mutation with an explicit
  TargetCheckpoint(agent_snapshot, trace_len) dataclass returned by snapshot()
  and consumed by restore() — no stashing internal keys on the SDK object, and
  trace/messages roll back together (AR-1/AR-2).
- Move per-case isolation into TargetSession.reset() (clears the wrapped agent's
  history + trace) instead of task.py reaching into agent.messages (AR-7).

Verified live against Bedrock: backtrack still rolls back (backtracks=4,
blocked=4) and the 2x2 cross-product matrix renders one row per case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(redteam): second-pass review cleanups on TargetSession + report

- Remove a duplicated paragraph in the crescendo module docstring.
- Add `from __future__ import annotations` to target_session.py to match every
  sibling module and keep the TargetCheckpoint forward reference safe.
- Export the session companions consistently: TargetCheckpoint joins TargetSession
  on the redteam facade (both are part of the strategy contract, like
  AttackRunResult); the two concrete impls are exported at the strategies package.
- Qualify the backtrack docstring/comment to "the target's state" — the attacker
  agent keeps its own history (a known, separate quality limitation).
- Parameterize trace annotations as list[dict[str, Any]] to match the strategies layer.
- Guard the report matrix against a base-case/strategy key collision: if stripping
  the cross-product suffix would hide a result, fall back to full names.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(redteam): add TargetSession.trim_trace; tidy flat-table case column

- Add trim_trace(length) to the TargetSession protocol so a strategy rolls the
  tool trace back through the session instead of mutating session.trace directly
  (addresses the bot review: the protocol never promised .trace returns a mutable
  reference, so a defensive-copy impl would have silently ghosted refused-turn
  tool calls). AgentTargetSession.restore now reuses trim_trace; Crescendo's
  non-rewindable backtrack calls it instead of `del target_session.trace[...]`.
- Report flat table / transcript header show the base case name (the strategy
  column already disambiguates the cross-product), so the full "{case}__{strategy}"
  name no longer overflows the column into the risk field.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(redteam): drop callable target, harden TargetSession contract

- remove CallableTargetSession; input is Agent | TargetSession (TypeError otherwise)
- rename AgentTargetSession -> StrandsAgentSession
- trace: property -> plain attribute; fold trim into restore()
- split invoke into _send + _tool_uses_in; add ToolUseEntry TypedDict
- crescendo: never backtrack a tool-call turn; stop on it (keep breach evidence)
- resilient trace extraction (placeholder on malformed block keeps the gate honest)

* test(redteam): use a real TargetSession in experiment wiring tests

The lambda agents in test_experiment.py hit the new _build_session TypeError
and passed only because the base experiment catches it as score=0 -- so the
default-task and cross-product wiring was never actually exercised (run_attack
was unreachable). Swap the lambdas for a _FakeSession so the intended paths run.

* fix(redteam): reset target to clean baseline, not just messages

StrandsAgentSession.reset() only cleared messages, but snapshot()/restore()
round-trip the full session preset (messages, state, conversation_manager_state,
interrupt_state). So agent state leaked across cases -- a tool writing
agent.state in case N would still be set in case N+1, which can flip a later
attack's outcome. The experiment now captures one clean baseline at task-build
time (before the first case, while the shared agent is still as-constructed) and
reset() rolls back through the same load_snapshot path restore() uses. Seeded
target history is preserved (it's part of the target definition); per-case state
is cleared.

* test(redteam): pin baseline-reset invariants; tighten _build_session typing

Follow-up to the reset fix after an adversarial review pass:
- type _build_session(baseline) as Snapshot | None instead of Any (it feeds
  load_snapshot, so a non-Snapshot would only surface as a swallowed per-case error)
- add a real-Agent test that one baseline survives repeated resets uncorrupted
  (the capture-once/replay-N aliasing risk), and a test locking the documented
  limitation that a no-baseline session does not isolate non-message state

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE][P0] Red-Teaming: Core Pipeline — multi-turn attack strategies, evaluator, experiment, reporting

4 participants